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Abstract There is a large variety of objects and appliances in human environments, 
such as stoves, coffee dispensers, juice extractors, and so on. It is challenging for 
a roboticist to program a robot for each of these object types and for each of their 
instantiations. In this work, we present a novel approach to manipulation planning 
based on the idea that many household objects share similarly-operated object parts. 
We formulate the manipulation planning as a structured prediction problem and de¬ 
sign a deep learning model that can handle large noise in the manipulation demon¬ 
strations and leams features from three different modalities: point-clouds, language 
and trajectory. In order to collect a large number of manipulation demonstrations for 
different objects, we developed a new crowd-sourcing platform called Robobarista. 
We test our model on our dataset consisting of 116 objects with 249 parts along 
with 250 language instructions, for which there are 1225 crowd-sourced manipula¬ 
tion demonstrations. We further show that our robot can even manipulate objects it 
has never seen before. 

Keywords — Robotics and Learning, Crowd-sourcing, Manipulation 

1 Introduction 

Consider the espresso machine in Figure[l]— even without having seen the machine 
before, a person can prepare a cup of latte by visually observing the machine and by 
reading a natural language instruction manual. This is possible because humans have 
vast prior experience of manipulating differently-shaped objects that share common 
parts such as ‘handles’ and ‘knobs’. In this work, our goal is to enable robots to 
generalize their manipulation ability to novel objects and tasks (e.g. toaster, sink, 
water fountain, toilet, soda dispenser). Using a large knowledge base of manipula¬ 
tion demonstrations, we build an algorithm that infers an appropriate manipulation 
trajectory given a point-cloud and natural language instructions. 
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The key idea in our work is 
that many objects designed for 
humans share many similarly- 
operated object parts such as 
‘handles’, ‘levers’, ‘triggers’, 
and ‘buttons’; and manipula¬ 
tion motions can be transferred 
even among completely dif¬ 
ferent objects if we represent 
motions with respect to ob¬ 
ject parts. For example, even 
if the robot has never seen the 
‘espresso machine’ before, the 
robot should be able to manip¬ 
ulate it if it has previously seen 
similarly-operated parts in other objects such as ‘urinal’, ‘soda dispenser’, and ‘re¬ 
stroom sink’ as illustrated in Figure[2] Object parts that are operated in similar fash¬ 
ion may not carry the same part name (e.g., ‘handle’) but would rather have some 
similarity in their shapes that allows the motion to be transferred between com¬ 
pletely different objects. 

If the sole task for the robot is to manipulate one specific espresso machine or just 
a few types of ‘handles’, a roboticist could manually program the exact sequence to 
be executed. However, in human environments, there is a large variety in the types 
of object and their instances. Classification of objects or object parts (e.g. ‘handle’) 
alone does not provide enough information for robots to actually manipulate them. 
Thus, rather than relying on scene understanding techniques (7] (33] Q3, we directly 
use 3D point-cloud for manipulation planning using machine learning algorithms. 

Such machine learning algorithms require a large dataset for training. However, 
collecting such large dataset of expert demonstrations is very expensive as it requires 
joint physical presence of the robot, an expert, and the object to be manipulated. 
In this work, we show that we can crowd-source the collection of manipulation 
demonstrations to the public over the web through our Robobarista platform and 
still outperform the model trained with expert demonstrations. 

The key challenges in our problem are in designing features and a learning model 
that integrates three completely different modalities of data (point-cloud, language 
and trajectory), and in handling significant amount of noise in crowd-sourced ma¬ 
nipulation demonstrations. Deep learning has made impact in related application ar¬ 
eas (e.g., vision EUEl, natural language processing BtI I. In this work, we present 
a deep learning model that can handle large noise in labels, with a new architecture 
that learns relations between the three different modalities. Furthermore, in contrast 
to previous approaches based on learning from demonstration (LfD) that learn a 
mapping from a state to an action a our work complements LfD as we focus on 
the entire manipulation motion (as opposed to a sequential state-action mapping). 

In order to validate our approach, we have collected a large dataset of 116 objects 
with 250 natural language instructions for which there are 1225 crowd-sourced 
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Fig. 1: First encounter of an espresso machine by our PR2 robot. 
Without ever having seen the machine before, given the language in¬ 
structions and a point-cloud from Kinect sensor, our robot is capable 
of finding appropriate manipulation trajectories from prior experience 
using our deep learning model. 
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Input: 


Output: Trajectory for parts 


"Hold the cup of espresso below the 
hot water nozzle." "Push down on 
the handle to add hot water" ... 


urinal flush valve 


soda dispenser 


Fig. 2: Object part and natural language instructions input to manipulation trajectory as output. Objects such 
as the espresso machine consist of distinct object parts, each of which requires a distinct manipulation trajectory for 
manipulation. For each part of the machine, we can re-use a manipulation trajectory that was used for some other object 
with similar parts. So, for an object part in a point-cloud (each object part colored on left), we can find a trajectory 
used to manipulate some other object (labeled on the right) that can be transferred (labeled in the center). With this 
approach, a robot can operate a new and previously unobserved object such as the ‘espresso machine’, by successfully 
transferring trajectories from other completely different but previously observed objects. Note that the input point-cloud 
is very noisy and incomplete (black represents missing points). 


manipulation trajectories from 71 non-expert users via our Robobarista web plat¬ 
form (http://robobarista.cs.cornell.edu). We also present experiments on 
our robot using our approach. In summary, the key contributions of this work are: 

• a novel approach to manipulation planning via part-based transfer between dif¬ 
ferent objects that allows manipulation of novel objects, 

• incorporation of crowd-sourcing to manipulation planning, 

• introduction of deep learning model that handles three modalities with noisy 
labels from crowd-sourcing, and 

• contribution of the first large manipulation dataset and experimental evaluation 
on this dataset. 


2 Related Work 

Scene Understanding. There has been great advancement in scene understanding 
11331 12811631 . in human activity detection f52l BTl . and in features for RGB-D im¬ 
ages and point-clouds EHIED- And, similar to our idea of using part-based transfers, 
the deformable part model 02] was effective in object detection. However, classi¬ 
fication of objects, object parts, or human activities alone does not provide enough 
information for a robot to reliably plan manipulation. Even a simple category such 
as kitchen sinks has so much variation in its instances, each differing in how it is 
operated: pulling the handle upwards, pushing upwards, pushing sideways, and so 
on. On the other hand, direct perception approach skips the intermediate object la¬ 
bels and directly perceives affordance based on the shape of the object lfT6l[30l . It 
focuses on detecting the part known to afford certain action such as ‘pour’ given the 
object, while we focus on predicting the correct motion given the object part. 
Manipulation Strategy. For highly specific tasks, many works manually sequence 
different controllers to accomplish complicated tasks such as baking cookies J8] and 
folding the laundry (35], or focus on learning specific motions such as grasping l26l 
and opening doors m. Others focus on learning to sequence different movements 
11531 [36ll but assume that there exist perfect controllers such as grasp and pour. 

For a more general task of manipulating new instances of objects, previous ap¬ 
proaches rely on finding articulation (SUED or using interaction [25], but they 
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are limited by tracking performance of a vision algorithm. Many objects that hu¬ 
mans operate daily have parts such as “knob” that are small, which leads to signifi¬ 
cant occlusion as manipulation is demonstrated. Another approach using part-based 
transfer between objects has been shown to be successful for grasping GO] El. We 
extend this approach and introduce a deep learning model that enables part-based 
transfer of trajectories by automatically learning relevant features. Our focus is on 
the generalization of manipulation trajectory via part-based transfer using point- 
clouds without knowing objects a priori and without assuming any of the sub-steps 
(‘approach’, ‘grasping’, and ‘manipulation’). 

Learning from Demonstration (LfD). The most successful approach for teaching 
robots tasks, such as helicopter maneuvers G1 or table tennis liTTl , has been based on 
LfD 0. Although LfD allows end users to demonstrate the task by simply taking 
the robot arms, it focuses on learning individual actions and separately relies on 
high level task composition 03 El or is often limited to previously seen objects 
ROl [39 1 . We believe that learning a single model for an action like “turning on” is 
impossible because human environment has many variations. 

Unlike learning a model from demonstration, instance-based learning EEl 
replicates one of the demonstrations. Similarly, we directly transfer one of the 
demonstrations but focus on generalizing manipulation planning to completely new 
objects, enabling robots to manipulate objects they have never seen before. 

Deep Learning. There has been great success with deep learning, especially in the 
domains of vision and natural language processing (e.g. Il29ll47l ). In robotics, deep 
learning has previously been successfully used for detecting grasps on multi-channel 
input of RGB-D images 1321 and for classifying terrain from long-range vision iflBll . 

Deep learning can also solve multi-modal problems 13811321 and structured prob¬ 
lems m . Our work builds on prior works and extends neural network to handle 
three modalities which are of completely different data type (point-cloud, language, 
and trajectory) while handling lots of label-noise originating from crowd-sourcing. 
Crowd-sourcing. Teaching robots how to manipulate different objects has often 
relied on experts BIG]- Among previous efforts to scale teaching to the crowd 0 
[54] [23], Forbes et al. ca employs a similar approach towards crowd-sourcing but 
collects multiple instances of similar table-top manipulation with same object, and 
others build web-based platform for crowd-sourcing manipulation Il56ll57l . These 
approaches either depend on the presence of an expert (due to a required special 
software) or require a real robot at a remote location. Our Robobarista platform 
borrows some components of a, but works on any standard web browser with 
OpenGL support and incorporates real point-clouds of various scenes. 

3 Our Approach 

The intuition for our approach is that many differently-shaped objects share similarly- 
operated object parts; thus, the manipulation trajectory of an object can be trans¬ 
ferred to a completely different object if they share similarly-operated parts. We 
formulate this problem as a structured prediction problem and introduce a deep 
learning model that handles three modalities of data and deals with noise in crowd- 
sourced data. Then, we introduce the crowd-sourcing platform Robobarista to easily 
scale the collection of manipulation demonstrations to non-experts on the web. 
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3.1 Problem Formulation 

The goal is to learn a function / that maps a given pair of point-cloud p £ of 
object part and language / G Jzf to a trajectory x G ST that can manipulate the object 
part as described by free-form natural language /: 

/: & x Jz? ->■ ST 

Point-cloud Representation. Each instance of point-cloud p G £? is represented as 
a set of n points in three-dimensional Euclidean space where each point (x.y.z) is 
represented with its RGB color (r,g,b): p = {/?W}" = i = {(x,y,z,r,g,&)W} (=1 . The 
size of the set vary for each instance. These points are often obtained by stitching 
together a sequence of sensor data from an RGBD sensor Il22) . 

Trajectory Representation. Each trajectory x G ff is represented as a sequence 
of m waypoints, where each waypoint consists of gripper status g, translation 
(t x ,t y ,t z ), and rotation (r x ,r y ,r z ,r w ) with respect to the origin: x = {x = 
{{g,t x ,t y ,t z ,r x ,r y ,r z ,r w )^}™ =x where g G {“open”,“closed”,“holding”}, g depends 
on the type of the end-effector, which we have assumed to be a two-fingered gripper 
like that of PR2 or Baxter. The rotation is represented as quaternions (r x ,r y ,r z ,r w ) 
instead of the more compact Euler angles to prevent problems such as the gimbal 
lock ED. 

Smooth Trajectory. To acquire a smooth trajectory from a waypoint-based trajec¬ 
tory x, we interpolate intermediate waypoints. Translation is linearly interpolated 
and the quaternion is interpolated using spherical linear interpolation (Slerp) E45I . 

3.2 Can transferred trajectories adapt without modification? 

Even if we have a trajectory to transfer, a conceptually transferable trajectory is not 
necessarily directly compatible if it is represented with respect to an inconsistent 
reference point. 

To make a trajectory compatible with a new situation without modifying the tra¬ 
jectory, we need a representation method for trajectories, based on point-cloud in¬ 
formation, that allows a direct transfer of a trajectory without any modification. 
Challenges. Making a trajectory compatible when transferred to a different object 
or to a different instance of the same object without modification can be challenging 
depending on the representation of trajectories and the variations in the location of 
the object, given in point-clouds. 

For robots with high degrees of freedom arms such as PR2 or Baxter robots, tra¬ 
jectories are commonly represented as a sequence of joint angles (in configuration 
space) E5| . With such representation, the robot needs to modify the trajectory for an 
object with forward and inverse kinematics even for a small change in the object’s 
position and orientation. Thus, trajectories in the configuration space are prone to 
errors as they are realigned with the object. They can be executed without modifica¬ 
tion only when the robot is in the exact same position and orientation with respect 
to the object. 

One approach that allows execution without modification is representing trajec¬ 
tories with respect to the object by aligning via point-cloud registration (e.g. DU). 
However, if the object is large (e.g. a stove) and has many parts (e.g. knobs and han- 
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dies), then object-based representation is prone to errors when individual parts have 
different translation and rotation. This limits the transfers to be between different 
instances of the same object that is small or has a simple structure. 

Lastly, it is even more challenging if two objects require similar trajectories, but 
have slightly different shapes. And this is made more difficult by limitations of the 
point-cloud data. As shown in left of Fig. [2] the point-cloud data, even when stitched 
from multiple angles, are very noisy compared to the RGB images. 

Our Solution. Transferred trajectories become compatible across different objects 
when trajectories are represented 1) in the task space rather than the configuration 
space, and 2) in the principal-axis based coordinate frame of the object part rather 
than the robot or the object. 

Trajectories can be represented in the task space by recording only the position 
and orientation of the end-effector. By doing so, we can focus on the actual inter¬ 
action between the robot and the environment rather than the movement of the arm. 
It is very rare that the arm configuration affects the completion of the task as long 
as there is no collision. With the trajectory represented as a sequence of gripper po¬ 
sition and orientation, the robot can find its arm configuration that is collision free 
with the environment using inverse kinematics. 

However, representing the trajectory in task space is not enough to make trans¬ 
fers compatible. It has to be in a common coordinate frame regardless of object’s 
orientation and shape. Thus, we align the negative z-axis along gravity and align the 
x-axis along the principal axis of the object part using PC A lf20l . With this represen¬ 
tation, even when the object part’s position and orientation changes, the trajectory 
does not need to change. The underlying assumption is that similarly operated object 
parts share similar shapes leading to a similar direction in their principal axes. 

4 Deep Learning for Manipulation Trajectory Transfer 

We use deep learning to find the most appropriate trajectory for the given point- 
cloud and natural language. Deep learning is mostly used for binary or multi-class 
classification or regression problem 0 with a uni-modal input. We introduce a deep 
learning model that can handle three completely different modalities of point-cloud, 
language, and trajectory and solve a structural problem with lots of label noise. 

The original structured prediction problem (/ : 0 s x 2z? -A 2?) is converted to a 
binary classification problem (/ : x 2z?) x 2? —> {0,1}). Intuitively, the model 

takes the input of point-cloud, language, and trajectory and outputs whether it is a 
good match (label y = 1) or a bad match (label y = 0). 

Model. Given an input of point-cloud, language, and trajectory, x = ((/?,/), t), as 
shown at the bottom of Figure[3] the goal is to classify as either y = 0 or 1 at the top. 
The first h l layer learns a separate layer of features for each modality of x (= h°) 
(38). The next layer learns the relations between the input (/?,/) and the output 
T of the original structured problem, combining two modalities at a time. The left 
combines point-cloud and trajectory and the right combines language and trajectory. 
The third layer h 3 learns the relation between these two combinations of modalities 
and the final layer y represents the binary label. 

Every layer h 1 uses the rectified linear unit (651 as the activation function: 
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h l = a(W'h l 1 +b‘) where a(-) = max( 0, •) 
with weights to be learned W l £ R MxN , where M and N represent the number of 
nodes in (i — l)-th and f-th layer respectively. The logistic regression is used in last 
layer for predicting the final label y. The probability that x = ((/?./). T) is a “good 
match” is computed as: P(Y = \ \x\W,b) = 1/(1 + e~ ( Wx+b )) 

Label Noise. When data contains 
lots of noisy label (noisy trajectory 
T) due to crowd-sourcing, not all 
crowd-sourced trajectories should be 
trusted as equally appropriate as will 
be shown in Sec. [7] 

For every pair of input ( p,l)j , we 
have # = {T,-1, T,- 2 ,, T,-}, a set of 
trajectories submitted by the crowd 
for ( pj)i. First, the best candidate 
label T* £ % for (p,l)i is selected 
as one of the labels with the smallest 
average trajectory distance (Sec. [5j 
to other labels: 

We assume that at least half of the crowd tried to give a reasonable demonstration. 
Thus a demonstration with the smallest average distance to all other demonstrations 
must be a good demonstration. 

Once we have found the most likely label T,* for (p. I ),-, we give the label 1 (“good 
match”) to T,*), making it the first positive example for the binary classifi¬ 

cation problem. Then we find more positive examples by finding other trajectories 
T 1 € S? such that A(<t g where t g is a threshold determined by the expert. 
Similarly, negative examples are generated by finding trajectories t' £ S? such that 
it is above some threshold A •> ^ ^ where t„ is determined by expert, and they 
are given label 0 (“bad match”). 

Pre-training. We use the stacked sparse de-noising auto-encoder (SSDA) to train 
weights W l and bias b‘ for each layer ED[65]|. Training occurs layer by layer from 
bottom to top trying to reconstruct the previous layer using SSDA. To learn param¬ 
eters for layer i, we build an auto-encoder which takes the corrupted output /i ,_1 
(binomial noise with corruption level p) of previous layer as input and minimizes 
the loss function ll65l with max-norm constraint H49I : 

W* = argminH/W 1 — A' -1 115 + A11 /i' 111 
w 

where /W = /(Vk'V + b‘) li‘ = f(W iT h + b 1 ) /W 1 = H~ l X 

|Wj| 2 <c 

Fine-tuning. The pre-trained neural network can be fine-tuned by minimizing the 
negative log-likelihood with the stochastic gradient method with mini-batches: 
NLL = — Y!j^ol°8(P(y = y l \ x \W,b)). To prevent over-fitting to the training data. 


[pool [OOP 


fr 1 IPPP1 [PPP1 
* l POP] I POP] 



ppp] 


point cloud (p) language (i) trajectory (t) 

Fig. 3: Our deep learning model for transferring manipula¬ 
tion trajectory. Our model takes the input x of three different 
modalities (point-cloud, language, and trajectory) and outputs 
y, whether it is a good match or bad match. It first learns features 
separately (h 1 ) for each modality and then learns the relation 
(/i 2 ) between input and output of the original structured prob¬ 
lem. Finally, last hidden layer h 3 learns relations of all these 
modalities. 
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we used dropout |[T9ll . which randomly drops a specified percentage of the output of 
every layer. 

Inference. Given the trained neural network, inference step finds the trajectory T 
that maximizes the output through sampling in the space of trajectory S ?: 


argmaxP(Y = l|x= ((p,l),z');W,b) 

x’eSr 

Since the space of trajectory & is infinitely large, based on our idea that we can 
transfer trajectories across objects, we only search trajectories that the model has 
seen in training phase. 

Data pre-processing. As seen in Sec. 3.1 each of the modalities (/?,/, t) can have 
any length. Thus, we pre-process to make each fixed in length. 

We represent point-cloud p of any arbitrary length as an occupancy grid where 
each cell indicates whether any point lives in the space it represents. Because point- 
cloud p consists of only the part of an object which is limited in size, we can rep¬ 
resent p using two occupancy grids of size 10 x 10 x 10 with different scales: one 
with each cell representing lxlxl (cm) and the other with each cell representing 
2.5 x 2.5 x 2.5(cm). 

Each language instruction is represented as a fixed-size bag-of-words representa¬ 
tion with stop words removed. Finally, for each trajectory t 6 J, we first compute 
its smooth interpolated trajectory T s £ 3? s (Sec. 3.1 1 , and then normalize all trajec¬ 
tories % to the same length while preserving the sequence of gripper status. 


5 Loss Function for Manipulation Trajectory 

Prior metrics for trajectories consider only their translations (e.g. 1271 ) and not their 
rotations and gripper status. We propose a new measure, which uses dynamic time 
warping, for evaluating manipulation trajectories. This measure non-linearly warps 
two trajectories of arbitrary lengths to produce a matching, and cumulative distance 
is computed as the sum of cost of all matched waypoints. The strength of this mea¬ 
sure is that weak ordering is maintained among matched waypoints and that every 
waypoint contributes to the cumulative distance. 

For two trajectories of arbitrary lengths, t \ and t b = {^}7=i > 

we define matrix D £ W tlAxm,i , where D(i.j) is the cumulative distance of an 
optimally-warped matching between trajectories up to index i and j, respectively, 
of each trajectory. The first column and the first row of D is initialized as D(i, 1) = 

U=i c ( t a > > 4 1) ) v * e [L^a] and D 0-J) = lLi c ( T i 1) i'4 <:) ) V 7 e [L^b], where c 
is a local cost function between two waypoints (discussed later). The rest of D is 

completed using dynamic programming: D(i,j) = c(z^,T^) + min{Z)(; — 1 ,j — 

Given the constraint that T^ 1 * is matched to Tg\ the formulation ensures that 
every waypoint contributes to the final cumulative distance Z>(/«a,'Mb)- Also, given 
a matched pair no waypoint preceding is matched to a waypoint 

succeeding r) ; / : , encoding weak ordering. 

The pairwise cost function c between matched waypoints T^ 1 and r) ; ,; is defined: 
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automated helper texts 

camera move control 
camera zoom control 


point-cloud 

CAD model (green/movable): 
object part being interacted 

CAD model (red/static): 
object part not being interacted 


interpolated waypoint 
waypoint being edited 
reset current demonstration 
play current demonstration 

trajectory edit bar 
add/remove waypoint 



current manual step to 
demonstrate 
iterate/play all 
demonstrations 
submit button 


simulated PR2 gripper 
(waypoint being edited) 
position/rotation control 
ghosted full demonstration 


Fig. 4: Screen-shot of Robobarista, the crowd-sourcing platform running on Chrome browser. We have built Robo¬ 
barista platform for collecting a large number of crowd demonstrations for teaching the robot. 
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where 


M'(T W ;r) =cxp(-7-||T ( ' ) || 2 ) 


(91 




dT (4°, 4") = || (t X ,ty,t z )® ~ (t x ,ty,t t )P || a 
dR{%A i = an gl e difference between and %g' > 

do(^A J) ) = 1(^=8^) 


The parameters a,1 3 are for scaling translation and rotation errors, and gripper status 
errors, respectively, y weighs the importance of a waypoint based on its distance to 
the object part. Finally, as trajectories vary in length, we normalize D(iha , nig) by the 
number of waypoint pairs that contribute to the cumulative sum, \D(mA,mB)\ pa th * 
(i.e. the length of the optimal warping path), giving the final form: 

,. / n D(m A ,m B ) 

distance (T A ,^b) = -,- 

\D\171a ? WlB ) | path* 

This distance function is used for noise-handling in our model and as the final eval¬ 
uation metric. 


6 Robobarista: crowd-sourcing platform 

In order to collect a large number of manipulation demonstrations from the crowd, 
we built a crowd-sourcing web platform that we call Robobarista (see Fig. [4}. It 
provides a virtual environment where non-expert users can teach robots via a web 
browser, without expert guidance or physical presence with a robot and a target 
object. 

The system simulates a situation where the user encounters a previously unseen 
target object and a natural language instruction manual for its manipulation. Within 
the web browser, users are shown a point-cloud in the 3-D viewer on the left and 
a manual on the right. A manual may involve several instructions, such as “Push 


















10 


Jaeyong Sung, Seok Hyun Jin, and Ashutosh Saxena 



"Pull the handle to squeeze 


"Pull the Crispy Rice handle 


Fig. 5: Examples from our dataset, each of which consists of a natural language instruction (top), an object part in 
point-cloud representation (highlighted), and a manipulation trajectory (below) collected via Robobarista. Objects range 
from kitchen appliances such as stove and rice cooker to urinals and sinks in restrooms. As our trajectories are collected 
from non-experts, they vary in quality from being likely to complete the manipulation task successfully (left of dashed 
line) to being unlikely to do so successfully (right of dashed line). 


down and pull the handle to open the door”. The user’s goal is to demonstrate how 
to manipulate the object in the scene for each instruction. 

The user starts by selecting one of the instructions on the right to demonstrate 
(Fig# Once selected, the target object part is highlighted and the trajectory edit 
bar appears below the 3-D viewer. Using the edit bar , which works like a video 
editor, the user can playback and edit the demonstration. Trajectory representation 
as a set of waypoints (Sec. HU is directly shown on the edit bar. The bar shows 
not only the set of waypoints (red/green) but also the interpolated waypoints (gray). 
The user can click the ‘play’ button or hover the cursor over the edit bar to exam¬ 
ine the current demonstration. The blurred trail of the current trajectory ( ghosted ) 
demonstration is also shown in the 3-D viewer to show its full expected path. 

Generating a full trajectory from scratch can be difficult for non-experts. Thus, 
similar to Forbes et al. ED, we provide a trajectory that the system has already seen 
for another object as the initial starting trajectory to editQ 

In order to simulate a realistic experience of manipulation, instead of simply 
showing a static point-cloud, we have overlaid CAD models for parts such as ‘han¬ 
dle’ so that functional parts actually move as the user tries to manipulate the object. 

A demonstration can be edited by: 1) modifying the position/orientation of a 
waypoint, 2) adding/removing a waypoint, and 3) opening/closing the gripper. Once 
a waypoint is selected, the PR2 gripper is shown with six directional arrows and 
three rings. Arrows are used to modify position while rings are used to modify the 
orientation. To add extra waypoints, the user can hover the cursor over an interpo¬ 
lated (gray) waypoint on the edit bar and click the plus(+) button. To remove an 
existing waypoint, the user can hover over it on the edit bar and click minus(-) to 
remove. As modification occurs, the edit bar and ghosted demonstration are updated 
with a new interpolation. Finally, for editing the status (open/close) of the gripper, 
the user can simply click on the gripper. 

For broader accessibility, all functionality of Robobarista, including 3-D viewer, 
is built using Javascript and WebGL. 

1 We have made sure that it does not initialize with trajectories from other folds to keep 5-fold 
cross-validation in experiment section valid. 
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7 Experiments 

Data. In order to test our model, we have collected a dataset of 116 point-clouds of 
objects with 249 object parts (examples shown in Figure [5]). There are also a total 
of 250 natural language instructions (in 155 manuals)]^ Using the crowd-sourcing 
platform Robobarista, we collected 1225 trajectories for these objects from 71 non¬ 
expert users on the Amazon Mechanical Turk. After a user is shown a 20-second 
instructional video, the user first completes a 2-minute tutorial task. At each session, 
the user was asked to complete 10 assignments where each consists of an object and 
a manual to be followed. 

For each object, we took raw RGB-D images with the Microsoft Kinect sensor 
and stitched them using Kinect Fusion l22l to form a denser point-cloud in order 
to incorporate different viewpoints of objects. Objects range from kitchen appli¬ 
ances such as ‘stove’, ‘toaster’, and ‘rice cooker’ to ‘urinal’, ‘soap dispenser’, and 
‘sink’ in restrooms. The dataset will be made available at http: //robobarista. 
cs.Cornell.edu 

Baselines. We compared our model against several baselines: 

1) Random Transfers (chance): Trajectories are selected at random from the set of 
trajectories in the training set. 

2) Object Part Classifier: To test our hypothesis that intermediate step of classifying 
object part does not guarantee successful transfers, we built an object part classifier 
using multiclass SVM ll58l on point-cloud features including local shape features 
J28l . histogram of curvatures 11421 . and distribution of points. Once classified, the 
nearest neighbor among the same object part class is selected for transfer. 

3) Structured support vector machine (SSVM): It is a standard practice to hand-code 
features for SSVM ll59l . which is solved with the cutting plane method l24l . We 
used our loss function (Sec. [5]) to train and experimented with many state-of-the-art 
features. 

4) Latent Structured SVM (LSSVM) + kinematic structure: The way an object is 
manipulated depends on its internal structure, whether it has a revolute, prismatic, 
or fixed joint. Borrowing from Sturm et al. ED, we encode joint type, center of the 
joint, and axis of the joint as the latent variable h £ -VP' in Latent SSVM |64) . 

5) Task-Similarity Transfers + random: It finds the most similar training task using 
(p,l) and transfer any one of the trajectories from the most similar task. The pair¬ 
wise similarities between the test case and every task of the training examples are 
computed by average mutual point-wise distance of two point-clouds after ICP l!6) 
and similarity in bag-of-words representations of language. 

6) Task-similarity Transfers + weighting: The previous method is problematic when 
non-expert demonstrations for the same task have varying qualities. Forbes et al. 
fl5l introduces a score function for weighting demonstrations based on weighted 
distance to the “seed” (expert) demonstration. Adapting to our scenario of not hav¬ 
ing any expert demonstration, we select the T that has the lowest average distance 
from all other demonstrations for the same task (similar to noise handling of Sec. [4]). 


- Although not necessary for training our model, we also collected trajectories from the expert for 
evaluation purposes. 
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7) Our model without Multi-modal Layer : This deep learning model concatenates 
all the input of three modalities and learns three hidden layers before the final layer. 

8) Our model without Noise Handling-. Our model is trained without noise handling. 
All of the trajectory collected from the crowd was trusted as a ground-truth label. 

9) Our model with Experts : Our model is trained using trajectory demonstrations 
from an expert which were collected for evaluation purpose. 


7.1 Results and Discussions 


We evaluated all models on 
our dataset using 5-fold cross- 
validation and the results are 
in Table |T] Rows list the mod¬ 
els we tested including our 
model and baselines. Each col¬ 
umn shows one of three eval¬ 
uations. First two use dy¬ 
namic time warping for manip¬ 
ulation trajectory (DTW-MT) 
from Sec. 0 The first column 
shows averaged DTW-MT for 


Table 1: Results on our dataset with 5-fold cross-validation. Rows list 
models we tested including our model and baselines. And each column 
shows a different metric used to evaluate the models. 



per manual 

per instruction 

Models 

DTW-MT 

DTW-MT 

Accuracy (%) 

chance 

28.0 (±0.8) 

27.8 (±0.6) 

11.2 (±1.0) 

object part classifier 
Structured SVM 
LSSVM + kinematic 1511 

21.0 (±1.6) 
17.4 (±0.9) 

22.9 (±2.2) 

21.4 (±1.6) 

17.5 (±1.6) 

23.3 (±5.1) 
26.9 (±2.6) 
40.8 (±2.5) 

similarity + random 
similarity + weights 1 151 

14.4 (±1.5) 
13.3 (±1.2) 

13.5 (±1.4) 

12.5 (±1.2) 

49.4 (±3.9) 
53.7 (±5.8) 

Ours w/o Multi-modal 
Ours w/o Noise-handling 
Ours with Experts 

13.7 (±1.6) 
14.0 (±2.3) 
12.5 (±1.5) 

13.3 (±1.6) 
13.7 (±2.1) 
12.1 (±1.6) 

51.9 (±7.9) 
49.7 (±10.0) 
53.1 (±7.6) 

Our Model 

13.0 (±1.3) 

12.2 (±1.1) 

60.0 (±5.1) 


each instruction manual consisting of one or more language instructions. The sec¬ 
ond column shows averaged DTW-MT for every test pair (p. 1). 

As DTW-MT values are not intuitive, we added the extra column “accuracy”, 


which shows the percentage of transferred trajectories with DTW-MT value less 


than 10. Through expert surveys, we found that when DTW-MT of manipulation 


trajectory is less than 10, the robot came up with a reasonable trajectory and will 


very likely be able to accomplish the given task. 

Can manipulation trajectory be transferred from completely different objects? 

Our full model performed 60.0% in accuracy (Table [TJ, outperforming the chance 
as well as other baseline algorithms we tested on our dataset. 

Fig. 0 shows two examples of successful transfers and one unsuccessful trans¬ 
fer by our model. In the hist example, the trajectory for pulling down on a cereal 


dispenser is transferred to a coffee dispenser. Because our approach to trajectory 
representation is based on the principal axis (Sec. |3.2| i, even though cereal and cof¬ 
fee dispenser handles are located and oriented differently, the transfer is a success. 


The second example shows a successful transfer from a DC power supply to a slow 


cooker, which have “knobs” of similar shape. The transfer was successful despite 


the difference in instructions (“Turn the switch..” and “Rotate the knob..”) and object 


type. 

The last example of Fig.[6]shows an unsuccessful transfer. Despite the similarity 
in two instructions, transfer was unsuccessful because the grinder’s knob was facing 
towards the front and the speaker’s knob was facing upwards. We fixed the z-axis 
along gravity because point-clouds are noisy and gravity can affect some manipula¬ 
tion tasks, but a more reliable method for finding the object coordinate frame and a 
better 3-D sensor should allow for more accurate transfers. 
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Successful Transfers 


Unsuccessful Transfer 



run ine i_oiossai 
Crunch handle to 
dispense." 


run aown on ine 
right handle to 
dispense the coffee." 


i urn ine swucn rcoiaie ine KnoD 

clockwise to switch clockwise to turn 

on the power supply" slow cooker on." 


"Rotate the knob "Rotate the speaker 

clockwise to start knob clockwise until 

grinding." it clicks." 


Fig. 6: Examples of successful and unsuccessful transfers of manipulation trajectory from left to right using our 
model. In first two examples, though the robot has never seen the ‘coffee dispenser’ and ‘slow cooker’ before, the robot 
has correctly identified that the trajectories of ‘cereal dispenser’ and ‘DC power supply’, respectively, can be used to 
manipulate them. 



Fig. 7: Examples of transferred trajectories being executed on PR2. On the left, PR2 is able to rotate the ‘knob’ to 
turn the lamp on. On the right, using two transferred trajectories, PR2 is able to hold the cup below the ‘nozzle’ and 
press the ‘lever’ of ‘coffee dispenser’. 


Does it ensure that the object is actually correctly manipulated? We do not claim 
that our model can find and execute manipulation trajectories for all objects. How¬ 
ever, for a large fraction of objects which the robot has never seen before, our model 
outperforms other models in finding correct manipulation trajectories. The contri¬ 
bution of this work is in the novel approach to manipulation planning which enables 
robots to manipulate objects they have never seen before. For some of the objects, 
correctly executing a transferred manipulation trajectory may require incorporating 
visual and force feedbacks 1 621 l60l in order for the execution to adapt exactly to the 
object as well as find a collision-free path If50l . 

Can we crowd-source the teaching of manipulation trajectories? When we 
trained our full model with expert demonstrations, which were collected for eval¬ 
uation purposes, it performed at 53.1% compared to 60.0% by our model trained 
with crowd-sourced data. Even with the significant noise in the label as shown in 
last two examples of Fig. [5] we believe that our model with crowd demonstrations 
performed better because our model can handle noise and because deep learning 
benefits from having a larger amount of data. Also note that all of our crowd users 
are real non-expert users from Amazon Mechanical Turk. 

Is segmentation required for the system? In vision community, even with the 
state-of-the-art techniques mm, detection of ‘manipulatable’ object parts such 
as ‘handle’ and ‘lever’ in point-cloud is by itself a challenging problem im Thus, 
we rely on human expert to pre-label parts of object to be manipulated. The point- 
cloud of the scene is over-segmented into thousands of supervoxels, from which 
the expert chooses the part of the object to be manipulated. Even with the input 
of the expert, segmented point-clouds are still extremely noisy because of the poor 
performance of the sensor on object parts with glossy surfaces. 









14 


Jaeyong Sung, Seok Hyun Jin, and Ashutosh Saxena 



Fig. 8: Visualization of a sample of learned high-level feature (two nodes) at last hidden layer Jr'. The point-cloud in 
the picture is given arbitrary axis-based color for visualization purpose. The left shows a node #1 at layer A 3 that learned 
to (“turn”, “knob”, “clockwise”) along with relevant point-cloud and trajectory. The right shows a node #51 at layer h 3 
that learned to “pull” handle. The visualization is created by selecting a set of words, a point-cloud, and a trajectory that 
maximize the activation at each layer and passing the highest activated set of inputs to higher level. 


Is intermediate object part labeling necessary? The Object Part Classifier per¬ 
formed at 23.3%, even though the multiclass SVM for finding object part label 
achieved over 70% accuracy in five major classes of object parts (‘button’, ‘knob’, 
‘handle’, ‘nozzle’, ‘lever’) among 13 classes. Finding the part label is not sufficient 
for finding a good manipulation trajectory because of large variations. Thus, our 
model which does not need part labels outperforms the Object Part Classifier. 

Can features be hand-coded? What kinds of features did the network learn? 
For both SS VM and LSS VM models, we experimented with several state-of-the-art 
features for many months, and they gave 40.8%. The task similarity method gave 
a better result of 53.7%, but it requires access to all of the raw training data (all 
point-clouds and language) at test time, which leads to heavy computation at test 
time and requires a large storage as the size of training data increases. 

While it is extremely difficult to find a good set of features for three modalities, 
our deep learning model which does not require hand-designing of features learned 
features at the top layer IP such as those shown in Fig. [8] The left shows a node 
that correctly associated point-cloud (axis-based coloring), trajectory, and language 
for the motion of turning a knob clockwise. The right shows a node that correctly 
associated for the motion of pulling the handle. 

Also, as shown for two other baselines using deep learning, when modalities 
were simply concatenated, it gave 51.9%, and when noisy labels were not handled, 
it gave only 49.7%. Both results show that our model can handle noise from crowd¬ 
sourcing while learning relations between three modalities. 

7.2 Robotic Experiments 

As the PR2 robot stands in front of the object, the robot is given a natural language 
instruction and segmented point-cloud. Using our algorithm, manipulation trajecto¬ 
ries to be transferred were found for the given point-clouds and languages. Given 
the trajectories which are defined as set of waypoints, the robot followed the tra¬ 
jectory by impedance controller (ee cart imped) QD- Some of the examples of 
successful execution on PR2 robot are shown in Figure[7]and in video at the project 
website: http : //robobarista . cs . Cornell. edu 
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8 Conclusion 

In this work, we introduced a novel approach to predicting manipulation trajecto¬ 
ries via part based transfer, which allowed robots to successfully manipulate objects 
it has never seen before. We formulated it as a structured-output problem and pre¬ 
sented a deep learning model capable of handling three completely different modal¬ 
ities of point-cloud, language, and trajectory while dealing with large noise in the 
manipulation demonstrations. We also designed a crowd-sourcing platform Robo¬ 
barista that allowed non-expert users to easily give manipulation demonstration over 
the web. Our deep learning model was evaluated against many baselines on a large 
dataset of 249 object parts with 1225 crowd-sourced demonstrations. In future work, 
we plan to share the learned model using the knowledge-engine, RoboBrain El. 
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